Quality Control and Pre-processing of Raw Data: trim and remove low-quality reads from raw data to get clean data, and perform quality control step;
Genome Assembly and Annotation: conduct de novo genome assembly and annotate assembly genome;
Genome Component Analysis: analyze components of the assembly genome, including the prediction of coding genes, non-coding RNAs, plasmid, prophage, integrative and conjugative element (ICE), insertion sequences (IS), genomic island (GI) and CRISPR-Cas;
Gene Function Analysis: annotate the sequences of coding genes based on different databases, including GO, KEGG, COG, NCBI Taxonomy and some databases related to the pathogenicity of pathogens.
Genytyping: perform genotyping based on PubMLST database;
Comparative Genomics Analysis: identify the phylogeny of the bacterial genomes to study the evolutionary relationships among different bacterial species or strains.
Statistics of TGS raw data:
Result files:
Preprecessing and quality control are performed on TGS raw data using Filtong[1] with parameter “–min_length 1000 –keep_percent 95” to filter reads shorter than 1000bp or with the lowest 5% quality values to acquire clean data. Statistics of TGS clean data:
Result files:
Unicycler[3] is used to assemble the complete genome using long-read TGS clean data and long-read TGS clean data, and then screen the chromosome and plasmid sequences. Statistics of genome assembly:
Result files:
Microbial genome contains very rich functional regions. In addition to coding gene regions, there are also non-coding regions that realize transcriptional regulation, post-transcriptional regulation, translational regulation, epigenetic regulation and other functions. Some functional regions are also related to the diversity of species evolution.
Bacterial mobile genetic elements (MGEs) are segments of DNA that can move between different positions within a bacterial genome or between different bacterial genomes. These MGEs can carry various genes, including those involved in antibiotic resistance, virulence, and metabolic pathways, and play a critical role in the evolution and adaptation of bacterial populations to changing environments. However, they can also contribute to the spread of antibiotic resistance and virulence genes among bacterial populations, posing a significant threat to public health.
Coding genes, repetitive sequences, non-coding RNA, mobile genetic elements and etc. were predicted to obtain the composition of the target genome through a variety of methods.
Prokka[3] is used to predict coding genes from genome assembly. Statistics of coding genes prediction results:
Plot the distribution of coding gene length:
Result files:
Non-coding RNA (ncRNA) is a class of RNA molecules that perform multiple biological functions. It does not carry information translated into protein itself, but directly plays a role in life activities at the RNA level. For microorganisms, researches commonly include transfer RNA (tRNA), ribosomal RNA (rRNA), small RNA (sRNA), etc.
Prokka[3] is used to predict tRNA and rRNA. Statistics of ncRNAs prediction results:
Result files:
Plasmid, which plays a crucial role in bacterial environmental adaptation, is an extrachromatic genetic element independent of chromosome replication. Because of their potential mobilization or binding capacity, plasmids are important genetic vectors for antimicrobial resistance genes and virulence factors, and have enormous and increasing clinical significance.
The PlasmidFinder database[4] is a plasmid replication subdatabase based on manual proofreading. Plasmid sequences were identified from bacterial genome sequencing data using Blast, which compared the annotation results of the assembled genome with the PlasmidFinder database.
The nucleic acid of a mild phage that is integrated into the host genome is called a prophage and can replicate or divide and pass synchronously with the host bacterial DNA. The presence of prophage sequences may make some bacteria acquire antibiotic resistance, enhance environmental adaptability, improve adhesion or make bacteria become pathogenic bacteria.
Phigaro[5] was used to predict the prophage on the genome assembly. Phigaro annotates the phage gene using the hidden Markov model (HMM) of the pVOGs (pVOGs) spectra of prokaryotes and viruses.
Result files:
Integrative and conjugative elements (ICE) are highly modular mobile genetic elements that are essential for horizontal transfer of antibiotic resistance and virulence factor genes. Integrons capture foreign gene boxes through site-specific recombination and enable them to be expressed. Meanwhile, integrons can be located on plasmids or participate in the transfer as a component of transposons to spread drug-resistant genes.
ICEfinder[6] can quickly detect ICE for bacterial genome sequence, and the ICE types detected include T4SS-type ICEs, AICEs and IMEs.
Result files:
Bacterial Insertion sequences (IS) are the smallest and most abundant autonomous transpose elements in prokaryotic genomes. IS is widely present in prokaryotic genomes and may occur with a high copy number. They play an important role in genome evolution, structure, and host genome fitness. Because of their mobility, IS acts as mutagens that can cause regulation of adjacent gene expression, affect virulence, alter exogenous or antimicrobial resistance, or regulate metabolic activity.
digIS[7] is based on the pHMM model assembled by the catalytic domain of transposition enzyme, and shows very good performance in detecting known IS. At the same time, digIS also has certain detection ability for detecting remote IS and speculated new IS.
Result files:
Genomic Island (GI) is a genomic region found in some bacteria, phages, or plasmids that is integrated into the microbial genome by horizontal gene transfer. A gene island can be related to a variety of biological functions such as pathogenic mechanism and organism adaptability. Comparative genomic analysis can be used to study the specificity and functional sources of microorganisms with special functions.
Based on phylogenetic composition, IslandPath-DIOMB[8] is applied to predict gene islands. IslandPath-DIOMB identifies gene islands and potential horizontal gene transfer by detecting dinonotide bias and mobility genes in the phylogenetic system, such as transposase or integrase.
Result files:
The CRISPR-Cas system is a prokaryotic immune system designed to defend against the invasion of foreign genetic material, such as phage viruses and foreign plasmids. It can recognize foreign DNA and silence the expression of foreign genes. CRISPR stands for Clustered Regularly Interspersed Short Palindromic Repeats – clusters of regularly interspersed short palindromic repeats consisting of short, conserved repeats and spacers. Cas exists near the CRISPR site and is a double-stranded DNA nuclease that can cut the target site under the guidance of guide RNA. The CRISPRs cluster and Cas proteins together form the CRISPR-Cas system. In this project, CRISPRCasTyper[9] is used to predict CRISPR-Cas.
Result files:
At present, annotations of GO, KEGG, COG and NCBI Taxonomy are performed.
The basic steps of functional annotation are as follows:
The protein sequences of predicted genes were mapped to eggNOG database[11] using eggNOG-mapper[10];
Apply functional annotation of corresponding database based on mapping results.
Result files:
Directory of result files: results/04.gene_function/*/eggnog-mapper.
The full name of GO is Gene Ontology[12], which is an internationally standardized classification system for gene function description. GO is divided into three main categories: 1) Cellular Component: used to describe subcellular structure, location, and macromolecular complex, such as nucleolus, telomeres, and complexes that recognize initiation; 2) Molecular Function: It is used to describe the function of individual genes and gene products, such as binding with carbohydrates or activity of ATP hydrolase; 3) Biological Process: It is used to describe the biological processes in which the products encoded by genes are involved, such as mitosis or purine metabolism.
Annotation results of three categories of GO database are shown as follows:
Result files:
KEGG full term Kyoto Encyclopedia of Genes and Genomes[13]. A database that systematically analyzes the metabolic pathways of gene products and compounds in cells and the functions of these gene products. It integrates genomic, molecular and biochemical data, including KEGG PATHWAY, KEGG DRUG, KEGG DISEASE, KEGG MODULE, KEGG GENES and genome GENOME) and so on. The KO(KEGG ORTHOLOG) system links various KEGG annotation systems together. KEGG has established a complete KO annotation system for functional annotation of the genome or transcriptome of newly sequenced species. See http://www.genome.jp/kegg/.
Plot the number of genes with annotation in KEGG database:
Result files:
COG, whose full name is Cluster of Orthologous Groups of proteins[14], is a protein database created and maintained by NCBI and constructed according to the phylogenetic relationship classification of proteins encoding from complete genomes of bacteria, algae, and eukaryotes. By comparison, a protein sequence can be annotated into a COG. Each COG cluster is composed of lineal homologous sequences, so the function of the sequence can be predicted. COG database according to the function can be divided into the total 26 classes, see https://www.ncbi.nlm.nih.gov/research/cog.
Plot of summary statistics:
Result files:
NCBI Taxonomy[15] is a planning classification and nominating method for all organisms in the NCBI public sequence database, including the names of all organisms corresponding to nucleotide or protein sequences in the NCBI gene database. See https://www.ncbi.nlm.nih.gov/taxonomy.
Plot of summary statitics:
Result files:
Secreted proteins are proteins that are synthesized inside the cell and secreted outside the cell to function. For example, some enzymes, antibodies and some hormones. Many pathogenic factors or small molecule metabolites belong to secreted proteins, so secreted protein analysis is of great significance for the study of pathogenic bacteria and metabolic bacteria. The 5’ end of the gene encoding secreted protein has a hydrophobic peptide fragment of 15 ~ 35 amino acids encoded by DNA, which is called signal peptide. It guides the subsequent protein polypeptide chain through the membrane structure, and is one of the markers of secreted protein. The secreted protein no longer has hydrophobic transmembrane region outside the signal peptide. After the signal peptide guides the secreted protein to cross the membrane, the signal peptidyase excises the signal peptide at the corresponding site, thus completing the secretion process of mature secreted protein.
Therefore, secreted proteins were analyzed by signal peptide prediction and transmembrane helix structure prediction. The signal peptide prediction tool signalP[16] and the transmembrane helix structure prediction tool TMHMM[17] were used to predict secreted proteins. Protein coding genes with signal peptide but without transmembrane helical structure outside the signal peptide region were selected as candidate secreted protein coding genes.
Result files:
VFDB database is called Virulence Factors of Pathogenic Bacteria[18]. It’s used for the study of pathogenic bacteria, chlamydia and mycoplasma. In addition to the species information and description of basic characteristics of virulence genes, It also provides a detailed description of virulence gene functions and pathogenic mechanisms.
Blast[19] was used to compare the nucleic acid sequence of the coding gene of the target species with the VFDB database, and the gene of the target species and its corresponding functional annotation information of virulence factors were combined to obtain annotation results.
Result files:
CARD Database is called Comprehensive Antibiotic Research Database[23], this database is a newly emerged resistance gene database in recent years, it has the advantages of comprehensive information, user-friendly, timely update and maintenance. The core composition of the database is Antibiotic Resistance Ontology (ARO), which integrates the sequence, antibiotic resistance, action mechanism, correlation between ARO and other information. Through the annotation of the database, we can find the name of antibiotic resistance related genes, the antibiotic type to be tolerated and so on.
Resistance Gene Identifier(RGI[21]) was used to map amino acid sequences of target species to CARD database (RGI built-in blastp, default evalue ≤ 1e-30). According to the comparison results of RGI, the resistance gene information annotated to the database was counted.
Result files:
MLST is a bacterial typing method mainly based on the determination of nucleotide sequences. Each locus sequence is assigned an allele number according to the time sequence of its discovery. The allele numbers of each strain in the specified order are its allele spectrum, that is, the sequence type (ST) of this strain. Each ST represents a set of individual nucleotide sequence information. The correlation of strains can be found by comparing ST, that is, the closely related strains have the same ST or only a very few different gene loci, while the unrelated strains have at least 3 or more gene loci different.
MLST is performed by a tool called mlst[22] based on PubMLST database[23]. (Strains without typing scheme in PubMLST database are not available for MLST typing)
Result files:
Comparative genomics analysis of bacterial genomes is a process of studying the evolutionary relationships among different bacterial species or strains. To identify the phylogeny of the bacterial genomes, pan-genome analysis is performed by Roary[24] to generate the core genome alignment, which will then be used by BacWGSpipe to construct a maximum-likelihood phylogenetic tree with IQ-TREE[25]. The output tree can be viewed and annotated by any tree-viewing software such as ITOL[26]. BacWGSpipe automatically generates annotation files of MLST, AMR, and VF in a format supported by ITOL that can be customized by users.
Result files:
[1] R. R. Wick and P. Menzel, “Filtlong: quality filtering tool for long reads,” Available online at: https://github.com/rrwick/Filtlong/, 2017.
[2] Wick, Ryan R et al. “Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads.” PLoS computational biology vol. 13,6 e1005595. 8 Jun. 2017, doi:10.1371/journal.pcbi.1005595
[3] Seemann, Torsten. “Prokka: rapid prokaryotic genome annotation.” Bioinformatics (Oxford, England) vol. 30,14 (2014): 2068-9. doi:10.1093/bioinformatics/btu153
[4] Carattoli, Alessandra et al. “In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing.” Antimicrobial agents and chemotherapy vol. 58,7 (2014): 3895-903. doi:10.1128/AAC.02412-14
[5] Starikova, Elizaveta V et al. “Phigaro: high-throughput prophage sequence annotation.” Bioinformatics (Oxford, England) vol. 36,12 (2020): 3882-3884. doi:10.1093/bioinformatics/btaa250
[6] Liu, Meng et al. “ICEberg 2.0: an updated database of bacterial integrative and conjugative elements.” Nucleic acids research vol. 47,D1 (2019): D660-D665. doi:10.1093/nar/gky1123
[7] Puterová, Janka, and Tomáš Martínek. “digIS: towards detecting distant and putative novel insertion sequence elements in prokaryotic genomes.” BMC bioinformatics vol. 22,1 258. 20 May. 2021, doi:10.1186/s12859-021-04177-6
[8] Bertelli, Claire, and Fiona S L Brinkman. “Improved genomic island predictions with IslandPath-DIMOB.” Bioinformatics (Oxford, England) vol. 34,13 (2018): 2161-2167. doi:10.1093/bioinformatics/bty095
[9] Russel, Jakob et al. “CRISPRCasTyper: Automated Identification, Annotation, and Classification of CRISPR-Cas Loci.” The CRISPR journal vol. 3,6 (2020): 462-469. doi:10.1089/crispr.2020.0059
[10] Cantalapiedra, Carlos P et al. “eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale.” Molecular biology and evolution vol. 38,12 (2021): 5825-5829. doi:10.1093/molbev/msab293
[11] Huerta-Cepas, Jaime et al. “eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses.” Nucleic acids research vol. 47,D1 (2019): D309-D314. doi:10.1093/nar/gky1085
[12] Ashburner, M et al. “Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.” Nature genetics vol. 25,1 (2000): 25-9. doi:10.1038/75556
[13] Kanehisa, M, and S Goto. “KEGG: kyoto encyclopedia of genes and genomes.” Nucleic acids research vol. 28,1 (2000): 27-30. doi:10.1093/nar/28.1.27
[14] Tatusov, Roman L et al. “The COG database: an updated version includes eukaryotes.” BMC bioinformatics vol. 4 (2003): 41. doi:10.1186/1471-2105-4-41
[15] Federhen, Scott. “The NCBI Taxonomy database.” Nucleic acids research vol. 40,Database issue (2012): D136-43. doi:10.1093/nar/gkr1178
[16] Teufel, Felix et al. “SignalP 6.0 predicts all five types of signal peptides using protein language models.” Nature biotechnology vol. 40,7 (2022): 1023-1025. doi:10.1038/s41587-021-01156-3
[17] Krogh, A et al. “Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.” Journal of molecular biology vol. 305,3 (2001): 567-80. doi:10.1006/jmbi.2000.4315
[18] Chen, Lihong et al. “VFDB: a reference database for bacterial virulence factors.” Nucleic acids research vol. 33,Database issue (2005): D325-8. doi:10.1093/nar/gki008
[19] Camacho, Christiam et al. “BLAST+: architecture and applications.” BMC bioinformatics vol. 10 421. 15 Dec. 2009, doi:10.1186/1471-2105-10-421
[20] McArthur, Andrew G et al. “The comprehensive antibiotic resistance database.” Antimicrobial agents and chemotherapy vol. 57,7 (2013): 3348-57. doi:10.1128/AAC.00419-13
[21] Alcock, Brian P et al. “CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database.” Nucleic acids research vol. 48,D1 (2020): D517-D525. doi:10.1093/nar/gkz935
[22] Seemann T, mlst Github https://github.com/tseemann/mlst
[23] Jolley, Keith A et al. “Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications.” Wellcome open research vol. 3 124. 24 Sep. 2018, doi:10.12688/wellcomeopenres.14826.1
[24] A. J. Page et al., “Roary: rapid large-scale prokaryote pan genome analysis,” Bioinformatics, vol. 31, no. 22, pp. 3691–3693, 2015.
[25] L.-T. Nguyen, H. A. Schmidt, A. Von Haeseler, and B. Q. Minh, “IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies,” Molecular biology and evolution, vol. 32, no. 1, pp. 268–274, 2015.
[26] I. Letunic and P. Bork, “Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation,” Nucleic acids research, vol. 49, no. W1, pp. W293–W296, 2021.